AITopics | proper noun

Collaborating Authors

proper noun

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

NegRefine: Refining Negative Label-Based Zero-Shot OOD Detection

Ansari, Amirhossein, Wang, Ke, Xiong, Pulei

arXiv.org Artificial IntelligenceJul-22-2025

Recent advancements in Vision-Language Models like CLIP have enabled zero-shot OOD detection by leveraging both image and textual label information. Among these, negative label-based methods such as NegLabel and CSP have shown promising results by utilizing a lexicon of words to define negative labels for distinguishing OOD samples. However, these methods suffer from detecting in-distribution samples as OOD due to negative labels that are subcategories of in-distribution labels or proper nouns. They also face limitations in handling images that match multiple in-distribution and negative labels. W e propose NegRefine, a novel negative label refinement framework for zero-shot OOD detection. By introducing a filtering mechanism to exclude subcategory labels and proper nouns from the negative label set and incorporating a multi-matching-aware scoring function that dynamically adjusts the contributions of multiple labels matching an image, NegRefine ensures a more robust separation between in-distribution and OOD samples. W e evaluate NegRefine on large-scale benchmarks, including ImageNet-1K.

large language model, machine learning, negative label, (20 more...)

arXiv.org Artificial Intelligence

2507.09795

Country: Europe > Switzerland (0.28)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Add feedback

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset

Bondok, Rawan, Nassar, Mayar, Khalifa, Salam, Micallef, Kurt, Habash, Nizar

arXiv.org Artificial IntelligenceJun-24-2025

Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.02656

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Minnesota (0.28)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Extracting domain-specific terms using contextual word embeddings

Repar, Andraž, Lavrač, Nada, Pollak, Senja

arXiv.org Artificial IntelligenceFeb-24-2025

Automated terminology extraction refers to the task of extracting meaningful terms from domain-specific texts. This paper proposes a novel machine learning approach to terminology extraction, which combines features from traditional term extraction systems with novel contextual features derived from contextual word embeddings. Instead of using a predefined list of part-of-speech patterns, we first analyse a new term-annotated corpus RSDO5 for the Slovenian language and devise a set of rules for term candidate selection and then generate statistical, linguistic and context-based features. We use a support-vector machine algorithm to train a classification model, evaluate it on the four domains (biomechanics, linguistics, chemistry, veterinary) of the RSDO5 corpus and compare the results with state-of-art term extraction approaches for the Slovenian language. Our approach provides significant improvements in terms of F1 score over the previous state-of-the-art, which proves that contextual word embeddings are valuable for improving term extraction.1. Introduction Automated terminology extraction (ATE) refers to the task of extracting meaningful terms from domain-specific texts. Terms are single-word (SWU) or multi-word units (MWU) of knowledge, which are relevant for a particular domain. Since manual identification of terms is costly and time consuming, ATE approaches can reduce the effort needed to generate relevant domain-specific terms. Recognizing and extracting domain-specific terms, which is useful in various fields, such as translation, dictionary creation, ontology generation and others, remains a difficult task.

corpus, extraction, frequency, (17 more...)

arXiv.org Artificial Intelligence

2502.17278

Country:

Europe > Slovenia > Gorizia > Municipality of Vipava > Vipava (0.04)
Europe > Slovenia > Gorizia > Municipality of Nova Gorica > Nova Gorica (0.04)
Europe > Slovenia > Central Slovenia > Municipality of Ljubljana > Ljubljana (0.04)
Asia > Malaysia (0.04)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Support Vector Machines (0.69)

Add feedback

METEOR: Evolutionary Journey of Large Language Models from Guidance to Self-Growth

Li, Jiawei, Xu, Xiaoang, Gao, Yang

arXiv.org Artificial IntelligenceNov-29-2024

Model evolution enables learning from feedback to refine experiences and update skills, transforming models from having no domain knowledge to becoming domain experts. However, there is currently no unified and effective method for guiding this evolutionary process. To address this gap, we propose the Meteor method, which includes three training phases: weak-to-strong data distillation, iterative training, and self-evolution strategies. Each phase maximizes the model's inherent domain capabilities, allowing it to autonomously refine its domain knowledge and enhance performance. Experiments demonstrate that our approach significantly improves accuracy, completeness, relevance, coherence, and reliability across domain-specific tasks.

large language model, machine learning, proper noun, (19 more...)

arXiv.org Artificial Intelligence

2411.11933

Country:

Europe > Austria > Vienna (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.05)
Asia > Singapore (0.04)
(8 more...)

Genre:

Overview (0.93)
Research Report > New Finding (0.46)

Industry: Education > Educational Setting > Online (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

Wang, Yutong, Zeng, Jiali, Liu, Xuebo, Wong, Derek F., Meng, Fandong, Zhou, Jie, Zhang, Min

arXiv.org Artificial IntelligenceOct-10-2024

Large language models (LLMs) have achieved reasonable quality improvements in machine translation (MT). However, most current research on MT-LLMs still faces significant challenges in maintaining translation consistency and accuracy when processing entire documents. In this paper, we introduce DelTA, a Document-levEL Translation Agent designed to overcome these limitations. DelTA features a multi-level memory structure that stores information across various granularities and spans, including Proper Noun Records, Bilingual Summary, Long-Term Memory, and Short-Term Memory, which are continuously retrieved and updated by auxiliary LLM-based components. Experimental results indicate that DelTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average. DelTA employs a sentence-by-sentence translation strategy, ensuring no sentence omissions and offering a memory-efficient solution compared to the mainstream method. Furthermore, DelTA improves pronoun translation accuracy, and the summary component of the agent also shows promise as a tool for query-based summarization tasks. We release our code and data at https://github.com/YutongWang1216/DocMTAgent.

computational linguistic, machine translation, translation, (14 more...)

arXiv.org Artificial Intelligence

2410.08143

Country:

Asia > Singapore (0.04)
North America > Canada > Quebec > Montreal (0.04)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(16 more...)

Genre: Research Report (0.83)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Annotation Guidelines for Corpus Novelties: Part 1 -- Named Entity Recognition

Amalvy, Arthur, Labatut, Vincent

arXiv.org Artificial IntelligenceOct-4-2024

It was constituted mainly to fulfill two goals: in the short term, train and test NER methods able to handle long texts, and in the longer term, be used to develop Renard [3], a pipeline aiming at extracting character networks from literary fiction. This pipeline includes several processing steps after the NER, including coreference resolution and character unification. Character networks can be used to tackle a number of tasks, including the assessment of literary theories, the level of historicity of a narrative, detecting roles in stories, classifying novels, identify subplots, segment a storyline, summarize a story, design recommendation systems, align narratives, etc. See the detailed survey of Labatut and Bost [11] for more information regarding character networks. This context drives the elaboration of the corpus, which explains why it exhibits certain differences with many similar NER corpora, such as CoNLL-2003 [17] or OntoNotes v5 [20]. We originally based Novelties on the literary corpus from Dekker et al. [6] as we describe in Section A of the appendix. Note that there are other literary NER corpora (cf.

annotate, annotation guideline, expression, (11 more...)

arXiv.org Artificial Intelligence

2410.02281

Country:

Europe > United Kingdom > England (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
North America > United States > New York (0.04)
(6 more...)

Genre: Research Report (0.40)

Industry: Leisure & Entertainment (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)

Add feedback

The Self-Contained Negation Test Set

Kletz, David, Amsili, Pascal, Candito, Marie

arXiv.org Artificial IntelligenceAug-21-2024

Several methodologies have recently been proposed to evaluate the ability of Pretrained Language Models (PLMs) to interpret negation. In this article, we build on Gubelmann and Handschuh (2022), which studies the modification of PLMs' predictions as a function of the polarity of inputs, in English. Crucially, this test uses ``self-contained'' inputs ending with a masked position: depending on the polarity of a verb in the input, a particular token is either semantically ruled out or allowed at the masked position. By replicating Gubelmann and Handschuh (2022) experiments, we have uncovered flaws that weaken the conclusions that can be drawn from this test. We thus propose an improved version, the Self-Contained Neg Test, which is more controlled, more systematic, and entirely based on examples forming minimal pairs varying only in the presence or absence of verbal negation in English. When applying our test to the roberta and bert base and large models, we show that only roberta-large shows trends that match the expectations, while bert-base is mostly insensitive to negation. For all the tested models though, in a significant number of test instances the top-1 prediction remains the token that is semantically forbidden by the context, which shows how much room for improvement remains for a proper treatment of the negation phenomenon.

act-token, negation, repetition, (16 more...)

arXiv.org Artificial Intelligence

2408.11469

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)

Genre: Research Report (0.64)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

Development of an AI Anti-Bullying System Using Large Language Model Key Topic Detection

Tassava, Matthew, Kolodjski, Cameron, Milbrath, Jordan, Bishop, Adorah, Flanders, Nathan, Fetsch, Robbie, Hanson, Danielle, Straub, Jeremy

arXiv.org Artificial IntelligenceAug-19-2024

It has become a pronounced problem due to the increasing ubiquity of online platforms that provide a means to conduct it. A significant amount of this cyberbullying is conducted by and targets teenagers. It is difficult for teenage students to shut themselves off from the digital world in which the cyberbullying is taking place. Given how entrenched the use of digital apps is by today's youth, and the pronounced consequences of it - including victim self-harm, in some cases - cyberbullying is at least as much of a threat as physical bullying. Additionally, because of the obfuscation caused by the online environment, authorities (such as parents, teachers and law enforcement) may have difficulty determining what has occurred and who the actors participating are.

json object, only respond, system prompt, (14 more...)

arXiv.org Artificial Intelligence

2408.10417

Country:

Africa (0.04)
Oceania > New Zealand (0.04)
Europe > Italy > Tuscany (0.04)
(9 more...)

Genre: Research Report (0.63)

Industry:

Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Education > Educational Setting > K-12 Education (0.45)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.83)

Add feedback

XSTEM: An exemplar-based stemming algorithm

Baker, Kirk

arXiv.org Artificial IntelligenceJun-2-2024

Stemming is the process of reducing related words to a standard form by removing affixes from them. For example, eating, eats, and eaten can be reduced to the standard form eat by removing the -ing, -s, and -en from each. Stemming is a foundational step in many text processing pipelines, including information retrieval and language modeling [2], and as a feature reduction step for classification or lexical transfer learning tasks [1]. Most stemming algorithms are either rule-based or corpus-based. Rule-based stemmers use a set of manually created rules to transform words to their base form, typically in conjunction with reference to a dictionary for handling exceptions or modulating their output in some fashion. Corpus-based stemmers typically employ statistical machine learning methods to equate word forms based on distributional regularities derived from large amounts of text. Although researchers have demonstrated impressive results with corpus-based approaches (e.g., [2, and references therein]), rule-based stemming implementa-I was introduced to exemplar theory by Keith Johnson in a phonetics seminar at Ohio State. His model is called XMOD, which, as he notes, is the best name [7]. The name XSTEM comes from that.

algorithm, suffix, xstem, (14 more...)

arXiv.org Artificial Intelligence

2205.04355

Country:

North America > United States > Ohio (0.24)
North America > United States > New York > New York County > New York City (0.04)
Asia > Maldives (0.04)

Genre: Research Report (0.50)

Industry:

Health & Medicine > Therapeutic Area (0.69)
Government > Regional Government > North America Government > United States Government (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Rule-Based Reasoning (0.77)

Add feedback

Silencing the Risk, Not the Whistle: A Semi-automated Text Sanitization Tool for Mitigating the Risk of Whistleblower Re-Identification

Staufer, Dimitri, Pallas, Frank, Berendt, Bettina

arXiv.org Artificial IntelligenceMay-2-2024

Whistleblowing is essential for ensuring transparency and accountability in both public and private sectors. However, (potential) whistleblowers often fear or face retaliation, even when reporting anonymously. The specific content of their disclosures and their distinct writing style may re-identify them as the source. Legal measures, such as the EU WBD, are limited in their scope and effectiveness. Therefore, computational methods to prevent re-identification are important complementary tools for encouraging whistleblowers to come forward. However, current text sanitization tools follow a one-size-fits-all approach and take an overly limited view of anonymity. They aim to mitigate identification risk by replacing typical high-risk words (such as person names and other NE labels) and combinations thereof with placeholders. Such an approach, however, is inadequate for the whistleblowing scenario since it neglects further re-identification potential in textual features, including writing style. Therefore, we propose, implement, and evaluate a novel classification and mitigation strategy for rewriting texts that involves the whistleblower in the assessment of the risk and utility. Our prototypical tool semi-automatically evaluates risk at the word/term level and applies risk-adapted anonymization techniques to produce a grammatically disjointed yet appropriately sanitized text. We then use a LLM that we fine-tuned for paraphrasing to render this text coherent and style-neutral. We evaluate our tool's effectiveness using court cases from the ECHR and excerpts from a real-world whistleblower testimony and measure the protection against authorship attribution (AA) attacks and utility loss statistically using the popular IMDb62 movie reviews dataset. Our method can significantly reduce AA accuracy from 98.81% to 31.22%, while preserving up to 73.1% of the original content's semantics.

identifier, information, whistleblower, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3630106.3658936

2405.01097

Country:

North America > United States (1.00)
Europe > Austria > Vienna (0.14)
North America > Montserrat (0.04)
(12 more...)

Genre: Research Report (1.00)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Health & Medicine (1.00)
Government > Regional Government > North America Government > United States Government (0.93)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(6 more...)

Add feedback